Skip to content

[BEAM-3525] Fix serialization of PipelineOptions in TestPipeline#run()#4478

Closed
pgerv12 wants to merge 1 commit into
apache:masterfrom
pgerv12:remove-premature-serialization
Closed

[BEAM-3525] Fix serialization of PipelineOptions in TestPipeline#run()#4478
pgerv12 wants to merge 1 commit into
apache:masterfrom
pgerv12:remove-premature-serialization

Conversation

@pgerv12

@pgerv12 pgerv12 commented Jan 24, 2018

Copy link
Copy Markdown
Contributor

Follow this checklist to help us incorporate your contribution quickly and easily:

  • Make sure there is a JIRA issue filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes.
  • Each commit in the pull request should have a meaningful subject line and body.
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue.
  • Write a pull request description that is detailed enough to understand what the pull request does, how, and why.
  • Run mvn clean verify to make sure basic checks pass. A more thorough check will be performed on your pull request automatically.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

This PR removes the serialization of PipelineOptions prematurely before the TestPipeline is run (which creates a runner instance, which in turn uses options to run the Beam application). Since this serialization occurs before the selected runner has the chance to use/look at them, any options marked with JsonIgnore do not survive -- leading to undesired behavior since serialization is really only needed between job creation/submission vs runtime.

@pgerv12

pgerv12 commented Jan 24, 2018

Copy link
Copy Markdown
Contributor Author

+reviewer: @lukecwik

@pgerv12 pgerv12 changed the title [BEAM-3525] Remove unnecessary serialization of PipelineOptions in TestPipeline in run() [BEAM-3525] Fix serialization of PipelineOptions in TestPipeline in run() Jan 24, 2018
@pgerv12 pgerv12 changed the title [BEAM-3525] Fix serialization of PipelineOptions in TestPipeline in run() [BEAM-3525] Fix serialization of PipelineOptions in TestPipeline#run() Jan 24, 2018
@robertwb

Copy link
Copy Markdown
Contributor

It seems good in general that TestPipeline tests that the serialization of pipeline options works. In particular, this may be needed if we want to pass options through the Runner API to the actual runner.

@pgerv12

pgerv12 commented Jan 24, 2018

Copy link
Copy Markdown
Contributor Author

Ok, then we need a way for tests to bypass this so specified options aren't dropped.

@kennknowles

Copy link
Copy Markdown
Member

In the portable future, options are allowed/expected to be serialized on their way to a runner. So that means @JsonIgnore options are just a way to use PipelineOptions for Beam-irrelevant command line parsing. We shouldn't really encourage that, since mature/stanardized command line libraries are more appropriate.

@pgerv12

pgerv12 commented Jan 24, 2018

Copy link
Copy Markdown
Contributor Author

@kennknowles Ah, ok. Thank you for the explanation!

@lukecwik

Copy link
Copy Markdown
Member

Serialization of pipeline options should only matter between SDK and runner like Kenn says and not between test infrastructure and SDK.

I'm all for having a good way to do args parsing and dependency passing outside of PipelineOptions, I just don't see that being feasible while being backwards compatible. Also using two argument parsing systems is typically awkward so I can't see a migration away from using PipelineOptions, only a full swap to something else.

@lukecwik lukecwik left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test so that this doesn't regress.

@kennknowles

Copy link
Copy Markdown
Member

What I mean is that just using PipelineOptions to configure a user's main() program is not great.

The PipelineOptions absolutely should go through a serialization round trip before they reach the runner, since this is to spec. We should fix anything that breaks because of it.

@kennknowles

Copy link
Copy Markdown
Member

@pgerv12 can you describe the particular problem that happened? I do think that existing runners may have a lot of dependencies on non-serializable options just for configuration. But for a non-Java SDKs the options have to come serialized, so we need to fix the runners to have all of their configuration specified in ways that can be sent that way.

@pgerv12

pgerv12 commented Jan 24, 2018

Copy link
Copy Markdown
Contributor Author

@kennknowles The Streams Runner has a --jarsToStage option which it uses to add dependent jars to its application bundle (marked with JsonIgnore). When testing Beam 2.2, our IT is failing because Classes could not be found because the jars weren't present. Looking at the reason, our runner never received the String passed in through the beamTestPipelineOptions system property variable. The TestPipeline properly parses the variable and sets its member options correctly with #testingPipelineOptions. During the run phase, this serialization occurs making jarsToStage null, this then gets to our runner and doesn't add any jars cause none are specified.

I can see where the non-Java SDK need arises and it is problematic since this is purely Java SDK-based. A proper solution sounds like there needs to be two types of serialization: one for complete PipelineOptions transfer used for testing/allowing runners to get all configuration parameters (used for test infrastructure -> SDK -> Runner translation), and the second is the FasterXML Jackson serialization (Runner translation -> Runner runtime).

For reference, it sounds like JsonIgnore should differentiate between whether a DoFn needs it during execution or not (I understand this may be old documentation): https://beam.apache.org/documentation/sdks/javadoc/2.2.0/org/apache/beam/sdk/options/PipelineOptions.html

@kennknowles

Copy link
Copy Markdown
Member

I see. This makes a lot of sense, and should be supported until we have a complete portability story.

For reference, in the portable model the Java SDK harness will execute the user's UDFs, so it would own options like this. Artifact staging goes over the "Artifact API" (just a proxy for staging via a runner-specific mechanism) and the Java construction side and Java execution side use it however they want, and just have to agree. So then @JsonIgnore is fine for files to stage again.

For your case, now, can you just make the option not @JsonIgnore? If that is a problem, them I'm OK with this change but we should develop a proper plan for TestPipeline to function appropriately for both styles of runner.

@pgerv12

pgerv12 commented Jan 24, 2018

Copy link
Copy Markdown
Contributor Author

Yes, our workaround it to remove the annotations and things work well with the change. I'm glad we're having this discussion because I didn't know what the proper behavior should be.

@lukecwik

Copy link
Copy Markdown
Member

jarsToStage sounds a lot like filesToStage. Note that filesToStage is not marked with @JsonIgnore for Flink/Spark/Dataflow.

Dropping the @JsonIgnore seems like the right approach but also consider migrating to use filesToStage and the same jar detection code which Flink/Spark/Dataflow all share as can be seen here:

if (flinkOptions.getFilesToStage() == null) {

@pgerv12

pgerv12 commented Jan 25, 2018

Copy link
Copy Markdown
Contributor Author

Is it the right approach? My argument is that filesToStage should be marked with @JsonIgnore since it's not really needed at the worker side. If that is or isn't the case, looking at the FlinkPipelineOptions, getStateBackend() is marked @JsonIgnore and it sounds like if I wanted to run tests with a different backend then that wouldn't be doable from my test parameters because of this problem (or if I used a non-Java SDK, I could never be able to set the backend through parameters since my arguments would always be serialized away).

Edit: I understand for portability that serialization need to occur so I would say if the goal is portability then the @JsonIgnore should disappear entirely.

@robertwb

robertwb commented Jan 25, 2018 via email

Copy link
Copy Markdown
Contributor

@pgerv12 pgerv12 closed this Jan 25, 2018
@lukecwik

Copy link
Copy Markdown
Member

There are reasons why a user my actually want to know what files were staged and this could be one way for them to get it. The original reason why it was marked with @JsonIgnore was because of a test infrastructure limitation in Dataflow 1.x. I would prefer to have as few options as possible marked with @JsonIgnore so users don't need to treat PipelineOptions differently during construction vs execution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants